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ABSTRACT In most preclinical disease models, survival analyses are the gold standard for measuring the efficacy of medical inter- 
ventions such as therapeutics or vaccines. In these analyses, treatment regimens that promote the survival and/or reduce the 
morbidity of experimental subjects (e.g., mice) are tested for efficacy. Although these analyses appear to be relatively straightfor- 
ward, there are associated caveats regarding interpretation of the results that we wish to discuss in this editorial. Of particular 
concern is overinterpretation of the biological significance of survival data based on statistical significance rather than durability 
of protection. 



Statistical significance, most often defined as aPvalue of <0.05, 
simply means that an observed quantitative difference would 
occur by chance <5% of time and does not necessarily imply 
biological significance. In the nonparametric analysis of survival 
data, the order of the events rather than the timing of the events is 
the basis for assessing differences between treatment groups. For 
example, if all mice in group 1 die before the mice in group 2, the 
results will be statistically significant regardless of whether the 
group 2 mice die 1 h later or 1 month later or not at all. Thus, in an 
experiment in which all deaths occur within a day or two but 
animals are monitored to determine precise survival times, group 
differences could be statistically significant but not biologically 
relevant. With this in mind, it has become apparent that in many 
recent publications in various highly respected journals, animals 
were monitored for survival multiple times a day or even hourly to 
obtain statistically significant results. These experimental settings 
allowed statistically significant results to be obtained even when 
the differences in survival time were clearly of limited biological 
significance. For example, in many studies, deletion of a particular 
gene led to a 1-day decrease or increase in the median time to 
death with a P value as low as <0.001. Such small differences 
suggest that the deaths were clustered within treatment groups. 
This could happen if all mice in a treatment group, for example, 
were kept in the same cage or were monitored as a group. 

To elaborate on this systematic problem, we generated three 
sets of hypothetical data that were designed to compare the effica- 
cies of vaccination for protection against viral challenge. Each data 
set consisted of two experimental groups (mock-treated mice and 
vaccinated mice), and each group consisted of five mice (Fig. 1). 
For the first set of data, both groups of mice were monitored once 
a day and all the mice died 10 days after viral challenge. This 
yielded a P value of 1, hence not rejecting the null hypothesis and 
not indicating any efficacy of vaccination (Fig. 1 A) . For the second 
data set, the same experiment was conducted except that the mice 
were monitored on an hourly basis. In this case, a difference of 1 h 
median time to death was observed between the two groups 
(Fig. IB). In this example, all the mice died around the same time, 
but in the mock- treated group, all mice died before any of the mice 
in the vaccinated group died. This may have occurred because the 
vaccine actually afforded a 1-h increase in protection. Alterna- 
tively, it could be that all the mice in the mock-treated group were 
assessed before the vaccinated mice, so that there was a temporal 
delay in data gathering. In either case, the null hypothesis in this 



instance was clearly rejected {P = 0.0027) and could be used to 
argue that indeed the vaccination was effective in providing sig- 
nificant protection. Given the clustering of deaths observed, how- 
ever, the results may have been due to nothing more than a cage 
effect, with no biological significance. Although the P values are 
identical for the vaccination results in Fig. IB and C, the vaccina- 
tion protocol in Fig. 1C had a difference of 100% in survival rates 
between the mock-treated group and the vaccinated group, with 
at least 10 more days of protection than the results shown in 
Fig. IB. This is because the log rank test considers only the order in 
which the animals die. Thus, from these hypothetical data sets, it is 
apparent that the use of statistical significance in survival analyses 
could be extremely deceptive. The biological effect, i.e., duration 
of protection, clearly needs to be assessed along with the statistical 
significance of the data. A mismatch between statistical signifi- 
cance and biological significance could be a red flag for a poorly 
designed study that measures something other than treatment ef- 
ficacy. 

The intent of this article is to serve as a reminder that statistical 
and biological significance should never be used interchangeably 
in survival studies that attempt to predict protective efficacy. In 
our opinion, for acute infections that cause death in 7 to 10 days, 
a 3-day difference in survival is the minimum value that would 
warrant further development. On the other hand, a 3-day differ- 
ence in survival would be biologically irrelevant for chronic infec- 
tions such as tuberculosis, in which desired differences would be 
weeks, months, or even years. Nevertheless, we along with others 
(1, 2) appreciate that defining particular quantitative changes as 
biologically or clinically significant is subjective, context depen- 
dent, and sometimes obscure. We suggest that the biological out- 
come from the experiment be considered first and then statistics 
applied to determine if the results are likely to be due to chance. In 
this process, it should be remembered that a cutoff P value of 0.05 
is relative; a P value of 0. 1 indicates that a particular result would 
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FIG 1 Hypothetical data sets illustrating that statistical significance does not necessarily correlate with biological relevance. Two groups of mice (5 mice per 
group ) were either mock treated or immunized with vaccine X and were challenged with a lethal dose of virus. Survival of the infected mice was monitored either 
daily (A) or hourly (B). A different vaccine, Y, was also tested for efficacy (C). Mice were monitored daily for survival. P values were derived by the Kaplan-Meier 
log rank test using GraphPad Prism 4 software. Statistical analyses gave identical P values for panels B and C; however, it is clear that biologically significant 
protection was seen only in panel C. PBS, phosphate-buffered saline. 



occur by chance 10% of the time. This could still reflect a biolog- 
ically important effect. It is hoped that these considerations will 
assist investigators in focusing more on the most promising out- 
comes in preclinical disease models and increase the impact of 
experimental results on development of effective cures or vaccines 
in humans. 
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